How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation

Pezzoni, Michele; Lissoni, Francesco; Tarasconi, Gianluca

doi:10.1007/s11192-014-1375-7

How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation

Published: 29 July 2014

Volume 101, pages 477–504, (2014)
Cite this article

Scientometrics Aims and scope Submit manuscript

Michele Pezzoni^1,3,
Francesco Lissoni^2,3 &
Gianluca Tarasconi³

1219 Accesses
27 Citations
3 Altmetric
Explore all metrics

Abstract

Inventor disambiguation is an increasingly important issue for users of patent data. We propose and test a number of refinements to the original Massacrator algorithm, originally proposed by Lissoni et al. (The keins database on academic inventors: methodology and contents, 2006) and now applied to APE-INV, a free access database funded by the European Science Foundation. Following Raffo and Lhuillery (Res Policy 38:1617–1627, 2009) we describe disambiguation as a three step process: cleaning&parsing, matching, and filtering. By means of sensitivity analysis, based on MonteCarlo simulations, we show how various filtering criteria can be manipulated in order to obtain optimal combinations of precision and recall (type I and type II errors). We also show how these different combinations generate different results for applications to studies on inventors’ productivity, mobility, and networking; and discuss quality issues related to linguistic issues. The filtering criteria based upon information on inventors’ addresses are sensitive to data quality, while those based upon information on co-inventorship networks are always effective. Details on data access and data quality improvement via feedback collection are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sources of inventive novelty: two patent classification schemas, same story

Article 04 May 2019

Blocking Patents and the Process of Innovation

Benjamin Moore & Co v. Canada: Reshuffling Patentable Subject-Matter

Article 05 April 2023

Notes

Access information for PatStat at: http://forums.epo.org/epo-worldwide-patent-statistical-database/ - last visited: 6/27/2014.
For the definition of “precision” and “recall”, see section Cleaning & Parsing.
See the post “Converting patstat text fields into plain ascii” on the RawPatentData blog (http://rawpatentdata.blogspot.com/2010/05/converting-patstat-text-fields-into.html ; last access: March, 2014).
As an example, consider token "ABCABC" as t1 and token "ABCD" as t2. The bigram sets for t1 and t2 will be respectively: (AB,BC,CA,AB,BC) and (AB,BC,CD). Applying Equation 1 returns:
$$ 2G(t1,\,t2)\, = \,\frac{{\sqrt{(2\, - \,1)_{AB}^{2} \, + \,(2\, - \,1)_{BC}^{2} \, + \,(1\, - \,0)_{CA}^{2} \, + \,(1\, - \,0)_{CD}^{2} } }}{5\, + \,3} $$
For a definition of patent family, see Martinez (2011).
Huang et al.'s original formula was proposed to compare inventors with no more than one patent each. We have adapted it to the case of inventors with multiple patents.
More precisely, NAFA and NAE contain matches between an inventor and one of his/her patents, and another inventor and one of his/her patents, plus information on whether the two inventors are the same person, according to information collected manually. Having been hand-checked, the matches in the benchmark databases are expected to contain neither false positives nor false negatives. Notice that both NAFA and NAE are based upon the PatStat October 2009 release. A detailed description is available online (Lissoni et al. 2010).
The NAFA and NAE frontiers, include not only the most extreme points, but are extended to include all outcomes with precision and recall values higher than $ {\text{Precision}}\left( { \bar{o} } \right) $-0.02 and $ Recall( \bar{o} ) $-0.02 for any $ \bar{o} $. This will turn out useful for the ensuing statistical exercise.
Remember that _W ω ^k is a random variable with expected value equal to 0.5. By definition, any sample with a different mean cannot be randomly drawn, and must be considered either over- or under-represented by comparison to a random distribution.In case the estimated impact of a criterion is not significantly different than zero for recall, but positive for precision, then it is desirable to include it in any parametrization, as it increases precision at no cost in terms of recall. Conversely, any filter with zero impact on precision, but significantly negative for recall, ought to be excluded from any parametrization, as it bears a cost in the terms of the latter, and no gains in terms of precision. We have conducted this type of analysis, and found it helpful to understand the relative importance of the different filtering criteria. We do not report it for reasons of space, but it is available on request.
Regression analysis can be applied to the same set of results in order to estimate the marginal impact of each filtering criterion and the Threshold on either precision and recall, other things being equal. In general, we expect all filters to bear a negative influence on recall (in that they increase the number of negative matches, both true and false), and a positive influence on precision (they eliminate false positives).
The figures presented here are the result of further adjustments we introduced in order to solve transitivity problems. Transitivity problems may emerge for any triplet of inventors (such as I, J, and Z) whenever two distinct pairs are recognized to be same person (e,g, I & J and J and Z), but the same does not apply to the remaining pair (I & Z are not matched, or are considered negative matches). In this case we need to decide whether to revise the status of I & Z (and consider the two inventors as the same person as J) or the status of the other pairs (and consider either I or Z as different persons than J). When confronting this problem, we always opted for considering the two inventors the same person, then I,J and Z are the same individual according to Massacrator.
Fields of chemistry and pharmaceuticals are defined as in Schmoch (2008). We consider only these fields, and years from 2000 and 2005, for ease of computation. Co-inventorship is intended as a connection between two inventors having (at least) one patent in common.
On immigration of inventors, see Miguelez and Fink (2013) and Breschi et al. (2014). Both papers provide information on ongoing attempts to classify inventors according to their nationality and/or country of origin (country of birth, or of parents’ or grandparents’ birth). In the near future, it will be possible to use such information to refine new versions of Massacrator (see Conclusions).

References

Agrawal, A., Cockburn, I., & McHale, J. (2006). Gone but not forgotten: knowledge flows, labor mobility, and enduring social relationships. Journal of Economic Geography, 6(5), 571.
Article Google Scholar
Azoulay, P., Ding, W., & Stuart, T. (2009). The impact of academic patenting on the rate, quality and direction of (public) research output. The Journal of Industrial Economics, 57, 637–676.
Article Google Scholar
Balconi, M., Breschi, S., & Lissoni, F. (2004). Networks of inventors and the role of academia: an exploration of Italian patent data. Research Policy, 33(1), 127–145.
Article Google Scholar
Barrai, I., Rodriguez-Larralde, A., Mamolini, E., & Scapoli, C. (1999). Isonymy and isolation by distance in Italy. Human biology, 71, 947–961.
Google Scholar
Bilenko, M., Kamath, B., & Mooney, R.J. (2006). Adaptive blocking: Learning to scale up record linkage, In Data Mining, 2006. ICDM’06. Sixth International Conference on. IEEE, pp. 87–96.
Borgatti, S. P., Mehra, A., Brass, D. J., & Labianca, G. (2009). Network analysis in the social sciences. science, 323(5916), 892–895.
Article Google Scholar
Breschi, S., & Lissoni, F. (2005). Knowledge networks from patent data. In H. F. Moed, W. Glänzel & U. Schmoch (Eds.), Handbook of Quantitative Science and Research. Amsterdam: Springer.
Breschi S., & Lissoni F. (2009). Mobility of skilled workers and co-invention networks: an anatomy of localized knowledge flows. Journal of Economic Geography.
Breschi, S., Lissoni, F., & Montobbio, F. (2008). University patenting and scientific productivity: a quantitative study of Italian academic inventors. European Management Review, 5(2), 91–109.
Article Google Scholar
Breschi, S., Lissoni, F., & Tarasconi, G. (2014). Inventor Data for Research on Migration & Innovation: a Survey and a Pilot. WIPO Economic Research Working Paper. N.17, World Intellectual Property Organization, Geneva.
Burt, R. S. (1987). Social contagion and innovation: cohesion versus structural equivalence. American journal of Sociology, 1287–1335.
Carayol, N., & Cassi, L. (2009). Who’s Who in Patents. A Bayesian approach. Cahiers du GREThA, 7, 07–2009.
Google Scholar
Den Besten M., Lissoni F., Maurino A., Pezzoni M., & Tarasconi G. (2012). Ape‐Inv Data Dissemination And Users’ Feedback Project”, mimeo (http://www.academicpatentig.eu).
Fleming, L., King, C., & Juda, A. I. (2007). Small Worlds and Regional Innovation. Organization Science, 18, 938–954.
Article Google Scholar
Freeman, L. C. (1979). Centrality in social networks conceptual clarification. Social Networks, 1(3), 215–239.
Article Google Scholar
Griliches, Z. (1990). Patent statistics as economic indicators: A survey. Journal of Economic Literature, 28(4), 1661–1707.
Huang, H., & Walsh, J. P. (2011). A new name-matching approach for searching patent inventors. mimeo.
Li, G.C., Lai, R., D’Amour, A., Doolin, D.M., Sun, Y., Torvik, V.I., Yu, A.Z., & Fleming, L. (2014). Disambiguation and co-authorship networks of the US patent inventor database. Research Policy, 43(6), 941–955.
Lissoni, F., Coffano, M., Maurino, A., Pezzoni, M., & Tarasconi, G. (2010). APE-INV’s Name Game Algorithm Challenge: A Guideline for Benchmark Data Analysis & Reporting. mimeo.
Lissoni, F., Llerena, P., McKelvey, M., & Sanditov, B. (2008). Academic patenting in Europe: new evidence from the KEINS database. Research Evaluation, 17(2), 87–102.
Article Google Scholar
Lissoni, F., Pezzoni, M., Poti, B., & Romagnosi, S. (2013). University Autonomy, the Professor Privilege and Academic Patenting: Italy, 1996–1997. Industry and Innovation, 20(5), 399–421.
Article Google Scholar
Lissoni, F., Sanditov, B., & Tarasconi, G. (2006). The Keins database on academic inventors: methodology and contents. WP cespri, 181.
Marx, M., Strumsky, D., & Fleming, L. (2009). Mobility, skills, and the Michigan non-compete experiment. Management Science, 55(6), 875–889.
Article Google Scholar
Maurino A., Li P. (2012). Deduplication of large personal database. Mimeo.
Miguelez, E., & Fink, C. (2013). Measuring the International Mobility of Inventors: A New Database, WIPO Economic Research Working Paper N.8, World Intellectual Property Organization, Geneva.
Nagaoka, S., Motohashi, K., & Goto, A. (2010). Patent statistics as an innovation indicator. Handbook of the Economics of Innovation, 2, 1083–1127.
On, B.-W., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework, In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, pp. 344–353.
Raffo, J., & Lhuillery, S. (2009). How to play the name game: patent retrieval comparing different heuristics. Research Policy, 38(10), 1617–1627.
Article Google Scholar
Schmoch, U. (2008). Concept of a technology classification for country comparisons. Final report to the World Intellectual Property Organization (WIPO), Fraunhofer Institute for Systems and Innovation Research, Karlsruhe.
Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual review of information science and technology, 4(1), 31–43.
Google Scholar
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. doi:10.1145/1552303.1552304.
Article Google Scholar
Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. doi:10.1002/asi.20105.
Article Google Scholar
Yasuda, N. (1983). Studies of isonymy and inbreeding in Japan. Human biology, 263–276.

Download references

Acknowledgements

This paper derives from research undertaken with the support of APE-INV, the Research Networking Programme on Academic Patenting in Europe, funded by the European Science Foundation.. Early drafts of benefitted from comments by participants to the APE-INV NameGame workshop series. We are also grateful to Nicolas Carayol, Lorenzo Cassi, Stephan Lhuillery and Julio Raffo for providing us with core data for the two benchmark datasets. Monica Coffano and Ernest Miguelez provided extremely valuable research assistantship. Andrea Maurino’s expertise on data quality has been extremely helpful.

Author information

Authors and Affiliations

CEMI, École Polytechnique Fédérale De Lausanne, Odyssea, 1015, Lausanne, Switzerland
Michele Pezzoni
GRETHA UMR 5113, Université de Bordeaux, Avenue Léon Duguit, 33608, Pessac cedex, France
Francesco Lissoni
CRIOS, Università Bocconi, Via G. Roentgen 1, 20136, Milan, Italy
Michele Pezzoni, Francesco Lissoni & Gianluca Tarasconi

Authors

Michele Pezzoni
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Lissoni
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Tarasconi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michele Pezzoni.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pezzoni, M., Lissoni, F. & Tarasconi, G. How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation. Scientometrics 101, 477–504 (2014). https://doi.org/10.1007/s11192-014-1375-7

Download citation

Received: 25 September 2013
Published: 29 July 2014
Issue Date: October 2014
DOI: https://doi.org/10.1007/s11192-014-1375-7

Keywords

JEL Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation

Abstract

Access this article

Similar content being viewed by others

Sources of inventive novelty: two patent classification schemas, same story

Blocking Patents and the Process of Innovation

Benjamin Moore & Co v. Canada: Reshuffling Patentable Subject-Matter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

JEL Classification

Navigation

How to kill inventors: testing the Massacrator© algorithm for inventor disambiguation

Abstract

Access this article

Similar content being viewed by others

Sources of inventive novelty: two patent classification schemas, same story

Blocking Patents and the Process of Innovation

Benjamin Moore & Co v. Canada: Reshuffling Patentable Subject-Matter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classification

Search

Navigation